Method and system for bit quantization of artificial neural network

ABSTRACT

The present disclosure provides a method for bit quantization of an artificial neural network. This method may comprise: (a) a step of selecting one parameter or one parameter group to be quantized in an artificial neural network; (b) a bit quantization step of reducing the size of data representation for the selected parameter or parameter group to bits; (c) a step of determining whether the accuracy of the artificial neural network is greater than or equal to a predetermined target value; and (d) a step of, when the accuracy of the artificial neural network is greater than or equal to the target value, repeatedly performing said step (a) to step (c).

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/254,039 filed on Dec. 18, 2020 which is a 35 U.S.C. 371 PatentApplication of PCT Application No. PCT/KR2020/002559 filed on Feb. 21,2020, which claims the benefit of Republic of Korea Patent ApplicationNo. 10-2019-0067585 filed on Jun. 7, 2019 and Republic of Korea PatentApplication No. 10-2019-0022047 filed on Feb. 25, 2019, each of which ishereby incorporated by reference in its entirety.

TECHNICAL FIELD

The present disclosure relates to a method and system for bitquantization of an artificial neural network, and more particularly, toa method and system for bit quantization capable of reducing memoryusage while maintaining substantial accuracy of an artificial neuralnetwork.

BACKGROUND ART

An artificial neural network is a computer structure modeling abiological brain. In an artificial neural network, nodes correspondingto neurons in the brain are interconnected, and the strength of synapticcoupling between neurons is expressed as a weight. The artificial neuralnetwork constructs a model with a given problem-solving ability bychanging the strength of synaptic coupling between nodes throughtraining by artificial neurons (nodes).

The artificial neural network may refer to a multi-layered perceptron, akind of a feedforward neural network, in a narrow sense, however, is notlimited thereto, and various types of neural networks, such as a radialbasis function network, a self-organizing network, and a recurrentneural network, may be included.

Recently, a multi-layered deep neural network is widely used as atechnology for image recognition, and a representative example of amulti-layered deep neural network is a convolutional neural network(CNN). In the case of a general multi-layered feedforward neuralnetwork, input data is limited to a one-dimensional form, but if animage data consisting of two to three dimensions is flattened intoone-dimensional data, spatial information is lost, and it can bedifficult to train the neural network while maintaining the spatialinformation of an image. However, the convolutional neural network cantrain visual information while maintaining 2D or 3D spatial information.

Specifically, convolutional neural network is effective in recognizingpatterns of visual data as it includes a Max-Pooling process thateffectively recognizes features of adjacent images while maintainingspatial information of images, and collects and reinforces features ofextracted images. However, a deep neural network having a multi-layeredstructure such as a convolutional neural network uses a deep layerstructure to provide high recognition performance, but its structure isvery complex and requires a large amount of computation and a largeamount of memory. In a multi-layered deep neural network, most of theoperations that occur internally are executed using multiplication andaddition or accumulation, and the number of connections between nodes inthe artificial neural network is large and parameters that requiremultiplication e.g., weight data, feature Map data, activation map data,and extra are large, so a large amount of computation is required in thetraining process or recognition process.

DETAILED DESCRIPTION OF THE PRESENT INVENTION Technical Problems

As discussed above, a large amount of computation and memory arerequired in the training and recognition process of a multi-layered deepneural network such as a convolutional neural network. As a method ofreducing the amount of computation and memory of a multi-layered deepneural network, a bit quantization method that reduces the datarepresentation size of parameters used in the computation of theartificial neural network in bit units may be used. In the conventionalbit quantization method, uniform bit quantization is used, whichquantizes all parameters of an artificial neural network with the samenumber of bits, and the conventional uniform bit quantization method hasa problem in that it does not accurately reflect the effect of changingthe number of bits for each parameter used in an artificial neuralnetwork on overall performance.

The embodiments disclosed in the present disclosure is to provide amethod and system for quantizing each parameter data configuring anartificial neural network or parameter data grouped according to aspecific criterion to a specific number of bits so that the artificialintelligence accuracy can be maintained while improving the overallperformance of the artificial neural network.

Means for Solving the Problems

According to an embodiment of the present disclosure, a method forquantizing bits of an artificial neural network is provided. The methodincludes the steps of: (a) selecting at least one parameter among aplurality of parameters used in the artificial neural network; (b) a bitquantizing to reduce the size of data required for an operation on theselected parameter to a unit of bits; (c) determining whether theaccuracy of the artificial neural network is greater than or equal to apredetermined target value; (d) if the accuracy of the artificial neuralnetwork is greater than or equal to the target value, steps (b) to (c)are repeatedly executed for the parameter to further reduce the numberof bits in the data representation of the parameter. In addition, thismethod further includes the steps of (e) if the accuracy of theartificial neural network is less than the target value, the number ofbits of the parameter is restored to the number of bits when theaccuracy of the artificial neural network is greater than the targetvalue, and then repeating the steps from (a) to (d).

According to an embodiment of the present disclosure, a method forquantizing bits of an artificial neural network is provided. The methodincludes the steps of: (a) selecting at least one of the plurality oflayers by a parameter selection module; (b) a bit quantizing to reduce asize of a data representation for a parameter of the selected layer to aunit of bits by a bit quantization module; (c) determining whether theaccuracy of the artificial neural network is greater than or equal to apredetermined target value by an accuracy determination module; and (d)repeating steps from (a) to (c) when the accuracy of the artificialneural network is greater than or equal to the target value.

According to an embodiment of the present disclosure, a method forquantizing bits of an artificial neural network is provided. This methodincludes the step of: (a) selecting one or more data or data of one ormore groups among weights, feature maps, and activation map data fromthe artificial neural network; (b) a bit quantizing to reduce a datarepresentation size for the selected data to a unit of bits by a bitquantization module; (c) measuring whether the artificial intelligenceaccuracy of the artificial neural network is greater than or equal to atarget value; and (d) repeating steps from (a) to (c) until there is nomore data to be quantized among the data of the artificial neuralnetwork.

According to an embodiment of the present disclosure, a method forquantizing bits of an artificial neural network is provided. This methodincludes: training the artificial neural network according to one ormore parameters of the artificial neural network; performing bitquantization on one or more parameters of the artificial neural networkaccording to the method of bit quantization of the artificial neuralnetwork according to the embodiments; and training the artificial neuralnetwork according to one or more parameters of the artificial neuralnetwork on which the bit quantization was performed.

According to another embodiment of the present disclosure, a system forquantizing bits of an artificial neural network is provided. The systemmay include: a parameter selection module for selecting at least oneparameter within the artificial neural network; a bit quantizing toreduce the size of the data representation of the selected parameter toa unit of bits; and an accuracy determination module that determineswhether the accuracy of the artificial neural network is greater than orequal to a predetermined target value. If the accuracy of the artificialneural network is more than the target value, the accuracy determinationmodule controls the parameter selection module and the bit quantizationmodule to execute quantization so that each of the plurality ofparameters has a minimum number of bits while maintaining the accuracyof the artificial neural network above the target value.

According to an embodiment of the present disclosure, a system forquantizing bits of an artificial neural network is provided. This systemincludes a parameter selection module for selecting at least one layeramong a plurality of layers configuring the artificial neural network; abit quantization module for reducing the size of the data representationfor the parameter of the selected layer to a unit of bits; and anaccuracy determination module for determining whether the accuracy ofthe artificial neural network is greater than or equal to apredetermined target value, and if the accuracy of the artificial neuralnetwork is equal to or greater than the target value, the accuracydetermination module controls the parameter selection module and the bitquantization module to perform bit quantization for another layer amongthe plurality of layers, and the bit quantization module sets n bits(where n is an integer of n>0) for all weights of the plurality oflayers, and sets m bits (where m is an integer of m>0) for output dataof the plurality of layers.

According to an embodiment of the present disclosure, a system forquantizing bits of an artificial neural network is provided. The systemcomprises: a parameter selection module for selecting at least one layerfrom a plurality of layers configuring the artificial neural network; abit quantization module for reducing the size of the data representationfor the parameter of the selected layer to a unit of bits; and anaccuracy determination module that determines whether the accuracy ofthe artificial neural network is greater than or equal to apredetermined target value, wherein when the accuracy of the artificialneural network is greater than or equal to the target value, theaccuracy determination module controls the parameter selection moduleand the bit quantization module to perform bit quantization for theother layer among the plurality of layers, and wherein the bitquantization module allocates n bits (where n is an integer of n>0) toweights of the plurality of layers and a output data, and sets a numberof bits allocated to each of the plurality of layers differently.

According to an embodiment of the present disclosure, a system forquantizing bits of an artificial neural network is provided. The systemcomprises: a parameter selection module for selecting at least one layerfrom a plurality of layers configuring the artificial neural network; abit quantization module for reducing the size of the data representationfor the parameter of the selected layer to a unit of bits; and anaccuracy determination module that determines whether the accuracy ofthe artificial neural network is greater than or equal to apredetermined target value, when the accuracy of the artificial neuralnetwork is greater than or equal to the target value, wherein theaccuracy determination module controls the parameter selection moduleand the bit quantization module to perform bit quantization for ananother layer among the plurality of layers, and wherein the bitquantization module individually and differently allocates weights ofthe plurality of layers and the number of bits of output data.

According to an embodiment of the present disclosure, a system forquantizing bits of an artificial neural network is provided. The systemcomprises: a parameter selection module for selecting at least one layerfrom a plurality of layers configuring the artificial neural network; abit quantization module for reducing a size of a memory for storing theparameter of the selected layer to a unit of bits; and an accuracydetermination module that determines whether the accuracy of theartificial neural network is greater than or equal to a predeterminedtarget value, wherein when the accuracy of the artificial neural networkis greater than or equal to the target value, the accuracy determinationmodule controls the parameter selection module and the bit quantizationmodule to perform bit quantization for an another layer among theplurality of layers, and wherein the bit quantization module allocates adifferent number of bits for each weight used in the plurality oflayers.

According to an embodiment of the present disclosure, a system forquantizing bits of an artificial neural network is provided. The systemcomprises: a parameter selection module for selecting at least one layerfrom a plurality of layers configuring the artificial neural network; abit quantization module for reducing the size of the data representationfor the parameter of the selected layer to a unit of bits; and anaccuracy determination module that determines whether the accuracy ofthe artificial neural network is equal to or greater than apredetermined target value, wherein when the accuracy of the artificialneural network is greater than or equal to the target value, theaccuracy determination module controls the parameter selection moduleand the bit quantization module to perform bit quantization for ananother layer among the plurality of layers, and wherein the bitquantization module individually allocates a different number of bits toa specific unit of output data output from the plurality of layers.

According to an embodiment of the present disclosure, a system forquantizing bits of an artificial neural network is provided. The systemcomprises: a parameter selection module for selecting at least one layerfrom a plurality of layers configuring the artificial neural network; abit quantization module for reducing the size of the data representationfor the parameter of the selected layer to a unit of bits; and anaccuracy determination module that determines whether the accuracy ofthe artificial neural network is greater than or equal to apredetermined target value, wherein when the accuracy of the artificialneural network is greater than or equal to the target value, theaccuracy determination module controls the parameter selection moduleand the bit quantization module to perform bit quantization for ananother layer among the plurality of layers, and wherein the bitquantization module allocates different bits to individual values ofoutput data output from the plurality of layers.

Effects of the Present Invention

According to various embodiments of the present disclosure, it ispossible to improve overall operation performance by quantizing thenumber of bits of data required for an operation such as training orinference in an artificial neural network.

In addition, it is possible to implement an artificial neural networkthat does not deteriorate the accuracy of artificial intelligence whilereducing hardware resources required to implement the artificial neuralnetwork and reducing power consumption and memory usage.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present disclosure will be described with referenceto the accompanying drawings described below, where like referencenumerals denote like elements, but are not limited thereto.

FIG. 1 is a diagram illustrating an example of an artificial neuralnetwork for obtaining output data for input data using a plurality oflayers and a plurality of layer weights according to an embodiment ofthe present disclosure.

FIGS. 2 to 3 are diagrams for explaining specific implementationexamples of the artificial neural network shown in FIG. 1 according toan embodiment of the present disclosure.

FIG. 4 is a diagram illustrating another example of an artificial neuralnetwork including a plurality of layers according to an embodiment ofthe present disclosure.

FIG. 5 is a diagram illustrating a weight kernel used for input data anda convolution operation in a convolution layer according to anembodiment of the present disclosure.

FIG. 6 is a diagram illustrating a procedure of generating a firstactivation map by performing convolution on input data using a firstweighting kernel according to an embodiment of the present disclosure.

FIG. 7 is a diagram illustrating a procedure of generating a secondactivation map by performing convolution on input data using a secondweighting kernel according to an embodiment of the present disclosure.

FIG. 8 is a diagram illustrating a process of calculating aconvolutional layer as a matrix according to an embodiment of thepresent disclosure.

FIG. 9 is a diagram illustrating an operation process of a fullyconnected layer in a matrix according to an embodiment of the presentdisclosure.

FIG. 10 is a diagram illustrating a bit quantization process of aconvolution layer as a matrix according to an embodiment of the presentdisclosure.

FIG. 11 is a flowchart illustrating a method of quantizing bits of anartificial neural network according to an embodiment of the presentdisclosure.

FIG. 12 is a flowchart illustrating a method of quantizing bits of anartificial neural network according to another embodiment of the presentdisclosure.

FIG. 13 is a flowchart illustrating a bit quantization method of anartificial neural network according to the other embodiment of thepresent disclosure.

FIG. 14 is a graph showing an example of an amount of computation foreach layer of an artificial neural network according to an embodiment ofthe present disclosure.

FIG. 15 is a graph showing the number of bits per layer of an artificialneural network in which bit quantization is performed by a forward bitquantization method according to an embodiment of the presentdisclosure.

FIG. 16 is a graph showing the number of bits per layer of an artificialneural network in which bit quantization is performed by a backward bitquantization method according to an embodiment of the presentdisclosure.

FIG. 17 is a graph showing the number of bits per layer of an artificialneural network in which bit quantization is performed by a highcomputational cost layer first bit quantization method according to anembodiment of the present disclosure.

FIG. 18 is a graph showing the number of bits per layer of an artificialneural network in which bit quantization is performed by a lowcomputational cost layer first bit quantization method according to anembodiment of the present disclosure.

FIG. 19 is a diagram illustrating an example of hardware implementationof an artificial neural network according to an embodiment of thepresent disclosure.

FIG. 20 is a diagram illustrating an example of hardware implementationof an artificial neural network according to another embodiment of thepresent disclosure.

FIG. 21 is a diagram illustrating an example of hardware implementationof an artificial neural network according to the other embodiment of thepresent disclosure.

FIG. 22 is a diagram illustrating a configuration of a system forperforming bit quantization on an artificial neural network according toan embodiment of the present disclosure.

MODE FOR CARRYING OUT THE INVENTION

Hereinafter, with reference to the accompanying drawings, specificdetails for carrying out the present disclosure will be described indetail. However, in the following description, if there is a possibilitythat the subject matter of the present disclosure may be unnecessarilyobscure, detailed descriptions of widely known functions orconfigurations may be omitted.

In the accompanying drawings, the same or corresponding elements areassigned the same reference numerals. In addition, in the description ofthe following embodiments, redundant descriptions of the same orcorresponding elements may be omitted. However, even if description ofan element is omitted, it is not intended that such element is notincluded in any embodiment.

In the present disclosure, “parameter” may mean one or more of anartificial neural network or weight data, feature map data, andactivation map data of each layer configuring the artificial neuralnetwork. In addition, the “parameter” may mean an artificial neuralnetwork or each layer configuring an artificial neural network expressedby such data. In addition, in the present disclosure, “bit quantization”may mean an operation or operation for reducing the number of bits in adata representation representing a parameter or a group of parameters.

The present disclosure provides various embodiments of a quantizationmethod and system for reducing a data representation size of a parameterused in a related operation to a unit of bits in order to reduce thecomputational, memory usage, and power consumption of digital hardwaresystems. In some embodiments, the bit quantization method and system ofthe present disclosure may reduce a size of a parameter used in anartificial neural network operation to a unit of bits. In general, datastructures of 32-bit, 16-bit, or 8-bit units (for example, CPU, GPU,memory, cache, buffer, and the like), are used for computation ofartificial neural networks. Accordingly, the quantization method andsystem of the present disclosure can reduce the size of a parameter usedfor calculating an artificial neural network to bits other than 32, 16,and 8 bits. Moreover, it is possible to individually and differentlyallocate a specific number of bits to each parameter or group ofparameters of the artificial neural network.

In some embodiments, the bit quantization method and system of thepresent disclosure may set n bits, where n is an integer of n>0, for allweights for an artificial neural network model and m bits, where m is aninteger of m>0, for output data of each layer.

In another embodiment, the bit quantization method and system of thepresent disclosure may allocate n bits to the weight and output data ofeach layer of the artificial neural network model, where n may be set toa different number for each layer.

In the other embodiment, the bit quantization method and system of thepresent disclosure allocates different bits to the weight and outputdata of each layer of the artificial neural network model, and moreover,a different number of bits may be allocated to each layer for a weightand an output feature map parameter in the corresponding layer.

The bit quantization method and system of the present disclosure can beapplied to various kinds of artificial neural networks. For example,when the bit quantization method and system of the present disclosure isapplied to a convolutional artificial neural network (CNN), differentbits can be individually assigned to weight kernels used in each layerof this artificial neural network.

In another embodiment, the bit quantization method and system of thepresent disclosure can allocate different bits for each weight used ineach layer of the multi-layered artificial neural network model,allocates individual bits to a specific unit of output data of eachlayer, or allocate different bits to individual values of the outputdata.

The bit quantization method and system according to various embodimentsof the present disclosure described above may apply any one of theabove-described embodiments to an artificial neural network model, butis not limited thereto, and one or more of these embodiments may becombined and applied to the artificial neural network model.

FIG. 1 is a diagram illustrating an example of an artificial neuralnetwork 100 that obtains output data for input data using a plurality oflayers and a plurality of layer weights according to an embodiment ofthe present disclosure.

In general, a multi-layered artificial neural network such as theartificial neural network 100 includes a statistical training algorithmimplemented based on the structure of a biological neural network inmachine learning technology and cognitive science, or a structure thatexecutes the algorithm thereof. That is, in an artificial neural network100, as in a biological neural network, nodes, which are artificialneurons that form a network by combining synapses, repeatedly adjust theweight of synapses, a machine learning model with problem solvingability can be created by training to reduce an error between thecorrect output corresponding to a specific input and the inferredoutput.

In one example, the artificial neural network 100 may be implemented asa multi-layer perceptron (MLP) composed of layers including one or morenodes and connections therebetween. However, the artificial neuralnetwork 100 according to the present embodiment is not limited to thestructure of the MLP, and may be implemented using one of variousartificial neural network structures having a multi-layer structure.

As shown in FIG. 1, when input data from the outside is provided, theartificial neural network 100 is configured to output the output datacorresponding to the input data through a plurality of layers 110_1,110_2, . . . , 110_N each composed of one or more nodes.

In general, the training method of the artificial neural network 100includes a supervised learning method that trains to be optimized tosolve a problem by inputting a teacher signal (correct answer), anunsupervised learning method that does not require a teacher signal, anda semi-supervised learning method that uses supervised learning andunsupervised learning together. The artificial neural network 100 shownin FIG. 1 uses at least one of a supervised learning method, anunsupervised learning method, and a semi-supervised learning methodaccording to the user's selection. Thus, it is possible to train theartificial neural network 100 that generates output data.

FIGS. 2 to 3 are views for explaining specific implementation examplesof the artificial neural network 100 shown in FIG. 1 according to anembodiment of the present disclosure.

Referring to FIG. 2, the artificial neural network 200 may include aninput node (X₀, X₁ . . . X_(n−1), X_(n)) into which the input data 210is input, an output node (Y₀, Y₁ . . . Y_(n−1), Y_(n)) that outputsoutput data corresponding to the input data 210, and hidden nodes andmultiple parameters located between input node and output node. Theinput node (X₀, X₁ . . . X_(n−1), X_(n)) is a node configuring the inputlayer 220 and receives input data 210, for example, an image, from theoutside, and the output node (Y₀, Y₁ . . . Y_(n−1), Y_(n)) is a nodeconfiguring the output layer 240 and may output the output data to theoutside. The hidden node located between the input node and the outputnode is a node configuring the hidden layer 230 and may connect outputdata of the input node to input data of the output node. Each node ofthe input layer 220 may be completely connected to each output node ofthe output layer 240 or may be incompletely connected, as shown in FIG.2. In addition, the input node may serve to receive input data from theoutside and transmit it to the hidden node. In this case, the hiddennode and the output node may perform calculations on the data, and thecalculation may be performed by multiplying the received input data by aparameter or weight. When the calculation of each node is completed, allcalculation results are summed, and then output data may be output byusing a preset activation function.

The hidden node and the output node (Y₀, Y₁ . . . Y_(n−1), Y_(n)) havean activation function. The activation function may be one among afunction, a sign function, a linear function, a logistic sigmoidfunction, a hyper tangent function, a ReLU function, and a softmaxfunction. The activation function may be appropriately determined by askilled person according to a learning method of an artificial neuralnetwork.

The artificial neural network 200 performs machine learning byrepeatedly updating or modifying weight values to appropriate values.Representative methods of machine learning by the artificial neuralnetwork 200 include supervised learning and unsupervised learning.

Supervised learning is a learning method in which weight values areupdated in a state that the target output data that the arbitrary neuralnetwork wants to compute for the input data is clearly defined, so thatoutput data obtained by putting the input data into the neural networkbecomes close to the target data. The multi-layered artificial neuralnetwork 200 of FIG. 2 may be generated based on supervised learning.

Referring to FIG. 3, as another example of a multi-layered artificialneural network, there is a convolutional neural network (CNN) 300, whichis a type of deep neural network (DNN). A convolutional neural network(CNN) is a neural network composed of one or several convolutionallayers, a pooling layer, and a fully connected layer. The convolutionalneural network (CNN) has a structure suitable for trainingtwo-dimensional data, and can be trained through a backpropagationalgorithm. It is one of the representative models of DNN that is widelyused in various application fields such as object classification andobject detection in images.

Here, it should be noted that the multi-layered artificial neuralnetwork of the present disclosure is not limited to the artificialneural networks shown in FIGS. 2 and 3, and a trained model may beobtained by machine learning other types of data in various otherartificial neural networks.

FIG. 4 is a diagram illustrating another example of an artificial neuralnetwork including a plurality of layers according to an embodiment ofthe present disclosure. The artificial neural network 400 as shown inFIG. 1 is a convolutional artificial neural network (CNN) including aplurality of convolutional layers (CONV) 420, a plurality of subsamplinglayers (SUBS) 430, and a plurality of fully connected layers (FC) 440.

The CONV 420 of the CNN 400 generates a feature map by applying aconvolution weight kernel to the input data 410. Here, the CONV 420 mayserve as a kind of template for extracting features fromhigh-dimensional input data, for example, images or videos.Specifically, one convolution may be repeatedly applied several timeswhile changing a location for a portion of the input data 410 to extractfeatures for the entire input data 410. In addition, the SUBS 430 servesto reduce the spatial resolution of the feature map generated by theCONV 420. The subsampling functions to reduce the dimension of the inputdata, for example, a feature map, and through this, it is possible toreduce the complexity of an analysis problem of the input data 410. TheSUBS 430 may use a max pooling operator that takes a maximum value or anaverage pooling operator that takes an average value for values of apart of the feature map. The SUBS 430 not only reduces the dimension ofthe feature map through a pooling operation, but also has the effect ofmaking the feature map robust against shift and distortion. Finally, theFC 440 may perform a function of classifying input data based on thefeature map.

The CNN 400 may execute various configurations and functions accordingto the number of layers of the CONV 420, SUBS 430, and FC 440 or thetype of operator. For example, the CNN 400 may include any one ofvarious CNN configurations such as AlexNet, VGGNet, LeNet, and ResNet,but is not limited thereto.

The CONV 420 of the CNN 400 having the configuration as described abovemay apply a weight to the input data 410 when image data is input as theinput data 410 to generate a feature map through a convolutionoperation, and in this case, a group of weights to be used may bereferred to as a weight kernel. The weight kernel is configured of athree-dimensional matrix of n×m×d (Here, n represents a row of aspecific size like the input image data, m represents a column of aspecific size, d represents a channel of the input image data, and thenumber of these dimensions is an integer greater than or equal to 1) anda feature map may be generated through a convolution operation bytraversing the input data 410 at specified intervals. At this time, ifthe input data 410 is a color image having a plurality of channels, forexample, three channels of RGB, the weight kernel may traverse eachchannel of the input data 410, calculate a convolution, and thengenerate a feature map for each channel.

FIG. 5 is a diagram illustrating input data of a convolution layer and aweight kernel used for a convolution operation according to anembodiment of the present disclosure.

As illustrated, the input data 510 may be an image or a video displayedin a two-dimensional matrix configured of rows 530 of a specific sizeand columns 540 of a specific size. As described above, the input data510 may have a plurality of channels 550, where the channel 550 mayrepresent the number of color components of the input data image.Meanwhile, the weight kernel 520 may be a weight kernel used forconvolution to extract features of the corresponding portion whilescanning a predetermined portion of the input data 510. Like the inputdata image, the weight kernel 520 may be configured to have a specificsized row 560, a specific sized column 570, and a specific number ofchannels 580. In general, the sizes of the rows 560 and the columns 570of the weight kernel 520 are set to be the same, and the number ofchannels 580 may be the same as the number of channels 550 of the inputdata image.

FIG. 6 is a diagram illustrating a procedure for generating a firstactivation map by performing convolution on input data using a firstkernel according to an embodiment of the present disclosure.

The first weight kernel 610 may be a weight kernel indicating the firstchannel of the weight kernel 620 of FIG. 2. The first weight kernel 610may finally generate the first activation map 630 by traversing theinput data at specified intervals and performing convolution. When thefirst weight kernel 610 is applied to a part of the input data 510,convolution is performed by adding all the values generated bymultiplying each of the input data values at a specific position of thepart and the values at the corresponding position of the weight kernel.Through this convolution process, a first result value 620 is generated,and each time the first weight kernel 610 traverses the input data 510,result values of the convolution are generated to form a feature map.Each element value of the feature map is converted into the firstactivation map 630 through the activation function of the convolutionallayer.

FIG. 7 is a diagram illustrating a procedure of generating a secondactivation map by performing convolution on input data using a secondweight kernel according to an embodiment of the present disclosure.

As shown in FIG. 6, after performing convolution on the input data 510using the first weight kernel 610 to generate the first activation map620, as shown in FIG. 7, the second activation map 730 may be generatedby performing convolution on the input data 510 using the second weightkernel 710.

The second weight kernel 710 may be a weight kernel indicating thesecond channel of the weight kernel 520 of FIG. 5. The second weightkernel 710 may finally generate the second activation map 730 bytraversing the input data at specified intervals and performingconvolution. As shown in FIG. 6, when the second weight kernel 710 isapplied to a part of the input data 510, convolution is performed byadding all the values generated by multiplying each of the input datavalues at a specific position of the part and the values at thecorresponding position of the weight kernel. Through this convolutionprocess, a second result value 720 is generated, and each time thesecond weight kernel 710 traverses the input data 510, result values ofthe convolution are generated to form a feature map. Each element valueof the feature map is converted into the second activation map 730through the activation function of the convolutional layer.

FIG. 8 is a diagram illustrating a computation process of aconvolutional layer in a matrix when an input feature map has onechannel according to an embodiment of the present disclosure.

The convolution layer 420 illustrated in FIG. 8 may correspond to theCONV 420 illustrated in FIG. 4. In FIG. 8, the input data 810 input tothe convolution layer 420 is displayed as a two-dimensional matrixhaving a size of 6×6, and the weight kernel 814 is displayed as atwo-dimensional matrix having a size of 3×3. However, the sizes of theinput data 810 and the weight kernel 814 of the convolution layer 420are not limited thereto, and may be variously changed according to theperformance and requirements of the artificial neural network includingthe convolution layer 420.

As illustrated, when input data 810 is input to the convolution layer420, the weight kernel 814 traverses the input data 810 at apredetermined interval, for example, 1, and thus, elementwisemultiplication in which the input data 810 and values at the sameposition of the weight kernel 814 are multiplied can be performed. Theweight kernel 814 traverses the input data 810 at regular intervals andsums 816 values obtained through elementwise multiplication.

Specifically, the weight kernel 814 assigns a value of elementwisemultiplication, for example, “3”, calculated at a specific location 820of the input data 810 to the corresponding element 824 of the featuremap 818. Next, the weight kernel 814 assigns a value of the elementwisemultiplication, for example, “1”, calculated at the next position 822 ofthe input data 810 to the corresponding element 826 of the feature map818. In this way, when the weight kernel 814 traverses the input data810 and allocates the values of the elementwise multiplicationcalculated to the feature map 818, the feature map 818 having a size of4×4 is completed. At this time, if the input data 810 is composed of,for example, three channels (R channel, G channel, B channel), and thenfeature maps for each channel may be generated through convolution inwhich the same weight kernel or different channels for each channeltraverse data for each channel of the input data 810 and performelementwise multiplication 812 and summation 816.

Referring back to FIG. 4, the CONV 420 may generate an activation map,which is the final output result of the convolution layer, by applyingan activation function to the feature map generated according to themethod described with reference to FIGS. 2 to 8. Here, the activationfunction may be any one of various activation functions such as asigmoid function, a radial basis function (RBF), a rectified linear unit(ReLU), or may be any one of various activation functions, a modifiedfunction thereof, or another function.

Meanwhile, the SUBS 430 receives the activation map, which is the outputdata of the CONV 420, as input data. The SUBS 430 performs a function ofreducing the size of the activation map or highlighting specific data.When the SUBS 430 uses max pooling, the maximum value of the value in aspecific area of the activation map is selected and output. In this way,noise of the input data can be removed through the pulling process ofthe SUBS 430 and the size of the data can be reduced.

In addition, the FC 440 may receive the output data of the SUBS 430 andgenerate the final output data 450. The activation map extracted fromthe SUBS 430 is one-dimensionally flattened to be input to the fullyconnected layer 440.

FIG. 9 is a diagram illustrating an operation process of a fullyconnected layer as a matrix according to an embodiment of the presentdisclosure.

The fully connected layer 440 shown in FIG. 9 may correspond to the FC440 of FIG. 4. As described above, the activation map extracted from themax pooling layer 430 may be flattened in one dimension to be input tothe fully connected layer 440. The activation map flattened in onedimension may be received as input data 910 from the fully connectedlayer 440. In the fully connected layer 440, an elementwisemultiplication 912 of the input data 910 and the weight kernel 914 maybe performed using the one-dimensional weight kernel 914. The result ofelementwise multiplication of the input data 910 and the weight kernel914 may be summed 916 and output as output data 918. In this case, theoutput data 918 may represent an inference value for the input data 410input to the CNN 400.

The CNN 400, having the above-described configuration, receives inputdata of a two-dimensional or one-dimensional matrix for each of aplurality of layers, and performs a training and inference process onthe input data through complex operations such as elementwisemultiplication and summation of weight kernels. Accordingly, dependingon the number of layers configuring the CNN 400 or the complexity ofoperations, resources, for example, the number of operators or theamount of memory, required for data training and inference may increaseconsiderably. Accordingly, in order to reduce the amount of computationand memory of an artificial neural network having a plurality of layers,such as the CNN 400, bit quantization for input and output data used foreach layer may be performed. In one embodiment, bit quantization of theCNN 400 having a plurality of layers may be performed for the CONV 420and the FC 440 that require a large amount of computation and memory.

FIG. 10 is a diagram illustrating a bit quantization process of aconvolution layer as a matrix according to an embodiment of the presentdisclosure.

Bit quantization performed in the convolutional layer may include weightor weight kernel quantization (1028) for reducing the number of bits ofeach element value of the weight kernel used in the convolutionoperation, and/or feature map quantization or activation mapquantization (1030) for reducing the number of bits of each elementvalue of the feature map or activation map.

The bit quantization process of the convolutional layer according to anembodiment may be performed as follows. Before performing convolution byapplying the weight kernel 1014 to the input data 1010 of theconvolution layer, a quantization 716 process is performed on the weightkernel 1014 to generate the quantized weight kernel 1018. In addition,by applying the quantized weight kernel 1018 to the input data 1010 andexecuting elementwise multiplication (1012) and summation (1020) tooutput convolutional values to generate a feature map, then anactivation map 1022 may be generated by an activation function. Next, afinal quantization activation map 1026 may be generated throughquantization 1024 for the activation map.

In the bit quantization process of the convolution layer describedabove, the weight kernel quantization 1028 may be performed using thefollowing equation.

$a_{q} = {{{quantization}\;\left( a_{f} \right)} = {\frac{1}{2^{k}} \times {round}\;\left( {2^{k} \times a_{f}} \right)}}$

Where, a_(f) is the weight value to be quantized, for example, theweight of a real number and each weight in the weight kernel, krepresents the number of bits to quantize, a_(q) represents the resultof a_(f) being quantized by k bits. That is, according to the aboveformula, firstly, a_(f) is multiplied by a predetermined binary number2^(k), so that a_(j) is incremented by k bits, hereinafter referred toas “the first value”. Next, by performing a rounding or truncationoperation on the first value, the number after the decimal point of,a_(j) is removed, hereinafter referred to as “second value”. The secondvalue is divided by a binary number of 2^(k), and the number of bits isreduced again by k bits, so that the element value of the finalquantized weight kernel can be calculated. Such weight or weight kernelquantization 1028 is repeatedly executed for all element values of theweight or weight kernel 1014 to generate quantized weight values 1018.

Meanwhile, the feature map or activation map quantization 1030 may beperformed by the following equation.

a_(f) = a_(f) ⋅ clip(−1, 1)$a_{q} = {{{quantization}\;\left( a_{f} \right)} = {\frac{1}{2^{k}} \times {{round}\left( {2^{k} \times a_{f}} \right)}}}$

In the feature map or activation map quantization 1030, the same formulaas the weight or weight kernel quantization 1028 method may be used.However, in feature map or activation map quantization, a process ofnormalizing each element value of the feature map or the activation map1022 to a value between 0 and 1 can be added by applying clipping beforequantization is applied for each element value a_(f), for example, areal number, of the feature map or activation map.

Next, the normalized a_(f) is multiplied by a predetermined binarynumber 2^(k), so that is incremented by k bits, hereinafter referred toas “the first value”. Next, by performing a rounding or truncationoperation on the first value, the number after the decimal point of,a_(f) is removed, hereinafter referred to as “second value”. The secondvalue is divided by a binary number of 2^(k), and the number of bits isreduced again by k bits, so that the element values of the finalquantized feature map or activation map 1026 may be calculated. Thequantization 1030 of such a feature map or activation map is repeatedlyexecuted for all the element values of the feature map or activation map1022 to generate a quantized feature map or activation map 1026.

Through the described weight or weight kernel quantization 1028 and thefeature map or activation map quantization 1030, the memory size and theamount of computation required for a convolution operation of theconvolutional layer 420 of the convolutional neural network can bereduced in a unit of bits.

FIG. 11 is a flowchart illustrating a method of quantizing bits of anartificial neural network according to an embodiment of the presentdisclosure. This embodiment is an example in which a unit of a datagroup that can be quantized in an artificial neural network is assumedto be all parameters belonging to each layer configuring an artificialneural network.

As shown, the bit quantization method 1100 of the artificial neuralnetwork may be initiated by selecting at least one layer from aplurality of layers included in the artificial neural network S1110. Alayer to be selected from a plurality of layers included in theartificial neural network may be determined according to the influenceof the layer to be selected on the overall performance or amount ofcomputation or amount of memory of the artificial neural network. In anembodiment, in the multi-layered artificial neural network describedwith reference to FIGS. 1 to 3 described above, a layer having a largeinfluence on the overall performance or computational amount of theartificial neural network may be arbitrarily selected. In addition, inthe case of the convolutional artificial neural network (CNN) 400described with reference to FIGS. 4 to 10, since the convolutional layer420 and/or the fully connected layer 440 has a large effect on theoverall performance or computational amount of the CNN 400, at least oneof these layers 420 and 440 may be selected.

A method of selecting at least one of the plurality of layers includedin the artificial neural network may be determined according to aninfluence of the selected layer on the overall performance orcomputational amount of the artificial neural network. However, thepresent disclosure is not limited thereto, and one of various methodsmay be included. For example, the selection of at least one layer from aplurality of layers included in the artificial neural network may beperformed according to (i) a method of sequentially selecting a layerfrom the first layer to which the input data is received to subsequentlayers according to the arrangement order of the plurality of layersconfiguring the artificial neural network, (ii) a method of sequentiallyselecting the last layer from which the final output data is generatedto the previous layer according to the arrangement order of theplurality of layers configuring the artificial neural network, (iii) amethod of selecting the layer with the highest computational amountamong the plurality of layers configuring the artificial neural network,or (iv) a method of selecting the least computational layer from theplurality of layers configuring the artificial neural network.

When the layer selection of the artificial neural network is completedin S1110, the operation may proceed to step S1120 of reducing the datarepresentation size for a parameter, for example, weight of the selectedlayer to a unit of bits.

In one embodiment, when the weight or the size of the output data amongthe parameters of the selected layer is reduced to a unit of bits, theweight kernel quantization 1028 and the activation map quantization 1024described with reference to FIGS. 4 to 10 may be performed. For example,the weight kernel quantization 1028 may be calculated by the followingequation.

$a_{q} = {{{quantization}\;\left( a_{f} \right)} = {\frac{1}{2^{k}} \times {round}\;\left( {2^{k} \times a_{f}} \right)}}$

Here, a_(f) denotes an element value of a weight kernel to be quantized,for example, a real weight kernel coefficient, k denotes the number ofbits to be quantized, and a_(q) denotes a result in which a_(f) isquantized by k bits. That is, according to the above equation, first,a_(f) is multiplied by a predetermined binary number 2^(k), so thata_(f) is increased by k bits, i.e., “the first value”. Next, byperforming a rounding or truncation operation on the first value, thenumber after the decimal point of a_(f) is removed, i.e., “the secondvalue”. The second value is divided by 2^(k) binary numbers, and thenumber of bits is reduced again by k bits, so that the element value ofthe final quantized weight kernel can be calculated. Such weight kernelquantization 1028 is repeatedly executed for all the element values ofthe weight kernel 1014 to generate a quantization weight kernel 1018.

Meanwhile, the activation map quantization 1030 may be performed by thefollowing equation.

a_(f) = a_(f) ⋅ clip(−1, 1)$a_{q} = {{{quantization}\;\left( a_{f} \right)} = {\frac{1}{2^{k}} \times {{round}\left( {2^{k} \times a_{f}} \right)}}}$

In the activation map quantization 1030, clipping is applied beforequantization for each element value a_(f), for example, a coefficient ofreal number, of the activation map 1022 is applied, so a process ofnormalizing each element value of the activation map 1022 to a valuebetween 0 and 1 may be added. Next, the normalized a_(f) is multipliedby a predetermined binary number 2^(k), so that a_(f) is increased by kbits, i.e., “the first value”. Next, by performing a rounding ortruncation operation on the first value, the number after the decimalpoint of a_(f) is removed, i.e., “the second value”. The second value isdivided by 2^(k) binary numbers, and the number of bits is reduced againby k bits, so that the element value of the final quantized activationmap 1026 may be calculated. Such quantization 1030 of the activation mapis repeatedly executed for all the element values of the activation map1022, and the quantization activation map 1026 is generated.

In the above-described embodiments, an example of reducing the number ofbits of the weight value or the activation map data has been describedin order to reduce the size of the data representation for the parameterof the layer selected in the artificial neural network. However, the bitquantization method of the present disclosure is not limited thereto. Inanother embodiment, different bits can be allocated for the data in theintermediate stages that exist between multiple computational steps forvarious data included in the selected layer in the artificial neuralnetwork. Accordingly, in order to reduce the size of a memory, forexample, buffer, register, or cache, in which each data is stored whenimplemented in hardware of an artificial neural network, the number ofbits of each data stored in the corresponding memory may be reduced andthe number of bits of the corresponding memory may be decreased. In theother embodiment, the size of a data bit of a data path through whichdata of a layer selected in the artificial neural network is transmittedmay be reduced in bit units.

After the execution of step S1120, a step S1130 of determining whetherthe accuracy of the artificial neural network is equal to or greaterthan a predetermined target value may proceed. If the accuracy of theoutput result of the artificial neural network, for example, thetraining result or the inference result of the artificial neuralnetwork, is more than the predetermined target value after reducing thedata representation size of the parameter of the selected layer in theartificial neural network to a unit of bits, it can be expected that theoverall performance of the artificial neural network can be maintainedeven by additionally reducing the bits of the data.

Accordingly, when it is determined in step S1130 that the accuracy ofthe artificial neural network is greater than or equal to the targetvalue, the process proceeds to step S1120 to further reduce the datarepresentation size of the selected layer to a unit of bits.

In step S1130, if the accuracy of the artificial neural network is nothigher than the target value, it may be determined that the accuracy ofthe artificial neural network is degraded due to the currently executedbit quantization. Accordingly, in this case, the minimum number of bitsthat satisfies the accuracy target value in the bit quantizationperformed immediately before may be determined as the final number ofbits for the parameter of the selected layer, i.e., step S1140.

Next, it is determined whether bit quantization for all layers of theartificial neural network is completed, i.e., step S1150. In this step,if it is determined that bit quantization for all layers of theartificial neural network is completed, the entire process isterminated. On the other hand, if a layer that has not been bitquantized yet remains among the layers of the artificial neural network,step S1110 is executed to perform bit quantization for the correspondinglayer.

Here, in the step S1110, the method of selecting the other layer fromthe plurality of layers included in the artificial neural network may beperformed according to (i) a method of sequentially selecting the nextlayer of the previously selected layer according to the arrangementorder of the plurality of layers configuring the artificial neuralnetwork, i.e., “forward bit quantization”, (ii) a method of selectingthe previous layer of the previously selected layer in the backwarddirection according to the arrangement order of the plurality of layersconfiguring the artificial neural network, i.e., “backward bitquantization”, (iii) a method of selecting a layer with a highercomputational amount after the previously selected layer according tothe order of computational amount among a plurality of layersconfiguring an artificial neural network, i.e., “high computational costbit quantization”, or (iv) a method of selecting a layer with a lowercomputational amount after the previously selected layer according tothe order of computational amount among a plurality of layersconfiguring an artificial neural network, i.e., “low computational costbit quantization”.

In one embodiment, the accuracy of the artificial neural network maymean the probability that the artificial neural network will provide asolution to the problem in the inference stage after learning a solutionto a given problem, for example, recognition of an object included in animage as input data. In addition, the target value used in the bitquantization method described above may represent a minimum accuracy tobe maintained after bit quantization of the artificial neural network.For example, assuming that the target value is 90% accuracy, even afterthe parameter of the layer selected by bit quantization is reduced inbit units, additional bit quantization may be performed if the accuracyof the artificial neural network is 90% or more. For example, afterperforming the first bit quantization, if the accuracy of the artificialneural network is measured to be 94%, additional bit quantization can beperformed. After executing the second bit quantization and if theaccuracy of the artificial neural network is measured to be 88% then thenumber of bits determined by the first bit quantization, the results ofthe currently performed bit quantization is discarded, and the number ofbits determined by the first bit quantization, i.e., the number of bitsfor representing the corresponding data, may be determined as the finalbit quantization result.

In one embodiment, in an artificial neural network including a pluralityof layers according to the computational cost bit quantization method,when selecting a layer to perform bit quantization based on the amountof computation among a plurality of layers, the computational amount ofeach layer can be determined as follows. That is, when one additionoperation performs an addition of n bits and m bits in a specific layerof the artificial neural network, the amount of operation is calculatedas (n+m)/2. In addition, when a specific layer of the artificial neuralnetwork multiplies n bits and m bits, the amount of calculation for thecorresponding operation may be calculated as n×m. Accordingly, theamount of calculation of a specific layer of the artificial neuralnetwork may be a result of summing all the calculation amounts ofaddition and multiplication performed by that layer.

In addition, according to the computational cost bit quantizationmethod, the method of performing bit quantization by selecting a layerfrom a plurality of layers based on the computational amount in anartificial neural network is not limited to that shown in FIG. 11 andvarious modifications are possible.

In another embodiment, bit quantization of a parameter for each layer inthe embodiment shown in FIG. 11 may be performed separately for a weightand an activation map. For example, first, quantization is performed onthe weight of the selected layer, and as a result of this, the weighthas n bits. Separately, by performing bit quantization on the outputactivation data of the selected layer, the number of representation bitsof the activation map data can be determined as m bits. Alternatively,quantization may be performed while allocating the same bit for theweight of the corresponding layer and the activation map data, and as aresult, the same n bits may be used for both the weight and theactivation map data.

FIG. 12 is a flowchart illustrating a bit quantization method of anartificial neural network according to another embodiment of the presentdisclosure.

As illustrated, the bit quantization method 1200 of an artificial neuralnetwork may be started with selecting a layer with the highestcomputational amount among a plurality of layers included in theartificial neural network S1210.

When the layer selection of the artificial neural network is completedin step S1210, the operation may proceed to step of reducing the size ofthe data representation for the parameter of the selected layer to aunit of bits S1220. In an embodiment, when the size of the data of theselected layer is reduced to a unit of bits, the weight kernelquantization 1028 and the activation map quantization 1024 describedwith reference to FIGS. 4 to 10 may be performed.

After the execution of step S1220, the step of determining whether theaccuracy of the artificial neural network reflecting the bitquantization result so far is greater than or equal to a predeterminedtarget value S1230 may proceed. If it is determined in step S1230 thatthe accuracy of the artificial neural network is greater than or equalto the target value, the size of the data of the corresponding layer isset as the current bit quantization result, and after proceeding to stepS1210, steps S1210 to S1230 may be repeatedly executed. That is, byproceeding to step S1210, the computational amount is calculated againfor all layers in the artificial neural network, and based on this, thelayer with the highest computational amount is selected again.

In step S1230, if the accuracy of the artificial neural network is nothigher than the target value, the bit reduction quantization for thecurrently selected layer is canceled, and that layer is excluded fromthe target of a layer that can be selected in the layer selection stepS1210. Then, a layer with a higher computational amount after thecorresponding layer may be selected, step S1240. Next, the size of thedata of the selected layer may be reduced to a unit of bits, step S1250.

In step S1260, it is determined whether the accuracy of the artificialneural network reflecting the bit quantization result so far is greaterthan or equal to a target value. If the accuracy of the artificialneural network is not higher than the target value, it is determinedwhether bit quantization for all layers of the artificial neural networkis completed, S1270. If it is determined in step S1270 that bitquantization for all layers of the artificial neural network iscompleted, the entire bit quantization process is terminated. On theother hand, if it is determined in step S1270 that bit quantization forall layers of the artificial neural network has not been completed, theprocess may proceed to step S1240.

If it is determined in step S1260 that the accuracy of the artificialneural network is greater than or equal to the target value, the processproceeds to step 1220 to proceed with a subsequent procedure.

FIG. 13 is a flowchart illustrating a bit quantization method of anartificial neural network having a plurality of layers according to theother embodiment of the present disclosure.

As shown, the bit quantization method 1300 of an artificial neuralnetwork having a plurality of layers includes steps S1310 to S1350 ofsearching for an accuracy fluctuation point for each of all layersincluded in the artificial neural network. The method 1300 starts withinitially fixing the bit size of data of all layers included in theartificial neural network to a maximum and selecting one layer in whichthe search for an accuracy fluctuation point has not been performedS1310.

When the layer selection of the artificial neural network is completedin step S1310, it may be proceeded to step of reducing the size of thedata of the selected layer to a unit of bits S1320. In an embodiment,when the size of the data of the selected layer is reduced to a unit ofbits, the weight kernel quantization 1028 and the activation mapquantization 1024 described with reference to FIGS. 4 to 10 may beperformed.

After the execution of step S1320, the step of determining whether theaccuracy of the artificial neural network reflecting the bitquantization result up to now for the selected layer is greater than orequal to a predetermined target value S1330 may be performed. If it isdetermined in step S1330 that the accuracy of the artificial neuralnetwork is greater than or equal to the target value, the processproceeds to step S1320 to perform additional bit reduction quantizationfor the currently selected layer.

In step S1330, if the accuracy of the artificial neural network is nothigher than the target value, the number of data bits of the currentlyselected layer is set to the minimum number of bits that have mostrecently satisfied the target value. Thereafter, it is determinedwhether the search for the accuracy variation point for all layers ofthe artificial neural network has been completed S1340. In this step, ifthe search for the accuracy variation point for all the layers is notcompleted, the process may proceed to step S1310. In step S1310, anotherlayer is selected in which the bit size of the data of all the layersincluded in the artificial neural network is the maximum and the searchfor the performance change point has not been performed.

If it is determined in step S1340 that the search for the accuracyvariation points for all layers of the artificial neural network hasbeen completed, the bit quantization result corresponding to theaccuracy variation point for each layer of the artificial neural networkmay be reflected in the artificial neural network S1350. In anembodiment, in step S1350, the corresponding layer is set to the bitsize of the data immediately before the accuracy variation point of eachlayer of the artificial neural network determined according to the stepsS1310 to S1350 described above, for example, the point where theaccuracy of the artificial neural network is degraded in each layer.

In another embodiment, in step S1350, the corresponding layer is set tobe larger than the size of the resource required for the calculation ofthe parameter immediately before the point of variation of the accuracyof each layer of the artificial neural network determined according tothe steps S1310 to S1340 described above. For example, the number ofbits of the parameter of each layer of the artificial neural network maybe set to be 2 bits larger than the number of bits immediately beforethe accuracy variation point. Then, the bit quantization method S1360 isperformed on the artificial neural network having the data size of eachlayer set in step 1350. The bit quantization method executed in stepS1360 may include, for example, the method shown in FIG. 11 or FIG. 12.

The bit quantization method of the artificial neural network accordingto the various embodiments described above is not limited to beingexecuted on the weight kernel and the feature map or activation map ofeach of a plurality of layers of the artificial neural network. In oneembodiment, the bit quantization method of the present disclosure, theweight kernels or weights of all layers of the artificial neural networkare first executed, and bit quantization may be performed again on thefeature maps of all the layers of the artificial neural network in whichthe weight kernel quantization is reflected. In another embodiment, bitquantization may be performed first on feature maps of all layers of theartificial neural network, and bit quantization may be performed againon kernels of all layers of the artificial neural network in which thefeature map quantization is reflected.

In addition, the bit quantization method of the artificial neuralnetwork of the present disclosure is not limited to applying the samelevel of bit quantization to the weight kernels of each layer of theartificial neural network. In one embodiment, the bit quantizationmethod of the present disclosure, bit quantization may be performed inunits of weight kernels of each layer of the artificial neural network,or individual bit quantization may be performed so as to have differentbits in each weight unit that is an element of each weight kernel.

Hereinafter, examples of execution results of a method for quantizingbits of an artificial neural network according to various embodiments ofthe present disclosure will be described with reference to the drawings.

FIG. 14 is a graph showing an example of an amount of computation foreach layer of an artificial neural network according to an embodiment ofthe present disclosure. The artificial neural network shown in FIG. 14is an example of a convolutional artificial neural network of the VGG-16model including 16 layers, and each layer of the artificial neuralnetwork has a different amount of computation.

For example, since the second layer, the fourth layer, the sixth layer,the seventh layer, the ninth layer, and the tenth layer have the highestamount of computation, bit quantization may be applied first when thehigh computational cost bit quantization method is followed. Inaddition, after bit quantization for the second, fourth, sixth, seventh,ninth, and tenth layers is performed, bit quantization may be performedfor a 14th layer with a next high computational amount.

FIG. 15 is a graph showing the number of bits per layer of an artificialneural network in which bit quantization has been performed by a forwardbit quantization method according to an embodiment of the presentdisclosure.

As described above, forward quantization is a method of sequentiallyperforming bit quantization from the earliest layer, for example, fromthe layer where input data is first received, based on the arrangementorder of a plurality of layers included in the artificial neuralnetwork. FIG. 15 shows the number of bits for each layer after applyingforward quantization to the artificial neural network of the VGG-16model shown in FIG. 14 and a reduction rate of the computational amountof the artificial neural network by forward quantization. For example,when the addition of n bits and m bits is performed, the amount ofcomputation for the corresponding computation is calculated as (n+m)/2.In addition, when multiplication of n bits and m bits is performed, theamount of computation for the corresponding computation may becalculated as n×m. Accordingly, the total amount of computation of theartificial neural network may be a result of summing all the computationamounts of addition and multiplication performed by the artificialneural network.

As shown, when bit quantization was performed on the artificial neuralnetwork of the VGG-16 model using forward quantization, the number ofbits of the layers arranged front side of the artificial neural networkhas been relatively reduced more and the number of bits of the layersarranged rear side of the artificial neural network has been relativelyreduced less. For example, while the number of bits of the first layerof the artificial neural network has been reduced to 12 bits and thenumber of bits of the second layer and the third layer is reduced to 9bits each, but the number of bits of the 16th layer decreased to 13 bitsand the number of bits of the 15th layer decreased only up to 15 bits.As described above, when forward quantization was sequentially appliedfrom the first layer to the 16th layer of the artificial neural network,the reduction rate of the total computational amount of the artificialneural network was calculated as 56%.

FIG. 16 is a graph showing the number of bits per layer of an artificialneural network in which bit quantization has been performed by abackward bit quantization method according to an embodiment of thepresent disclosure.

Backward quantization is a method of sequentially performing bitquantization from the last layer, for example, from the layer whereoutput data is finally output, based on the arrangement order of aplurality of layers included in the artificial neural network. FIG. 16shows the number of bits for each layer after applying backwardquantization to the artificial neural network of the VGG-16 model shownin FIG. 14 and a reduction rate of the computational amount of theartificial neural network by backward quantization.

As shown, when bit quantization was performed on the artificial neuralnetwork of the VGG-16 model using backward quantization, the number ofbits of the layers arranged rear side of the artificial neural networkhas been relatively reduced more and the number of bits of the layersarranged front side of the artificial neural network has been relativelyreduced less. For example, the number of bits of the first layer, thesecond layer, and the third layer is reduced to 15 bits, respectively,and the number of bits of the fourth layer is reduced to 14 bits, whilethe number of bits of the 16th layer is reduced to 9 bits, and thenumber of bits of the 15th layer is reduced to 15 bits. As describedabove, when backward quantization was sequentially applied from thefirst layer to the 16th layer of the artificial neural network, thereduction rate of the total computational amount of the artificialneural network was calculated as 43.05%.

FIG. 17 is a graph showing the number of bits for each layer of anartificial neural network in which bit quantization is performed by ahigh computational cost layer first bit quantization method according toan embodiment of the present disclosure.

High computation layer first quantization or high computationquantization is a method of sequentially performing bit quantizationfrom a layer with a higher computational amount among a plurality oflayers included in an artificial neural network. FIG. 17 shows thenumber of bits for each layer and a reduction rate of the computationalamount of the artificial neural network by the high computationquantization after applying high computation quantization to theartificial neural network of the VGG-16 model shown in FIG. 14.

As shown, when bit quantization is performed on the artificial neuralnetwork of the VGG-16 model using high computation quantization, thenumber of bits of the layers with high computational amount among theplurality of layers of the artificial neural network is relativelyreduced more. For example, the number of bits of the second layer andthe tenth layer is reduced to 5 and 6 bits, respectively, while thenumber of bits of the first layer is reduced to 14 bits. In this way,when high computation quantization was applied to the layers of theartificial neural network in order of computation amount, the reductionrate of the computation amount of the entire artificial neural networkwas calculated as 70.70%.

FIG. 18 is a graph showing the number of bits per layer of an artificialneural network in which bit quantization is performed by a lowcomputational cost bit quantization method according to an embodiment ofthe present disclosure.

Low computation layer first quantization or low computation quantizationis a method of sequentially performing bit quantization from a layerwith a lower computational amount among a plurality of layers includedin an artificial neural network. FIG. 18 shows the number of bits foreach layer after applying the low computation amount quantization andthe reduction rate of the computation amount of the artificial neuralnetwork by the low computation amount quantization for the artificialneural network of the VGG-16 model shown in FIG. 14.

As shown, even when bit quantization is performed on the artificialneural network of the VGG-16 model using low computation amountquantization, the number of bits of the high computation amount layersamong the plurality of layers of the artificial neural network isrelatively reduced. For example, the number of bits of the sixth layerand the seventh layer is reduced to 6 and 5 bits, respectively, whilethe number of bits of the first layer is reduced to 13 bits. As such,when the low computation amount quantization was applied to the layersof the artificial neural network in order of computation amount, thereduction rate of the computation amount of the entire artificial neuralnetwork was calculated as 49.11%.

Hereinafter, hardware implementation examples of an artificial neuralnetwork to which bit quantization is applied according to variousembodiments of the present disclosure described above will be describedin detail. When a convolutional artificial neural network including aplurality of layers is implemented in hardware, the weight kernel may bearranged outside and/or inside a processing unit for performingconvolution of the convolutional layers.

In one embodiment, the weight kernel may be stored in a memory, forexample, register, buffer, cache, and the like, separated from aprocessing unit for performing convolution of the convolutional layer.In this case, after bit quantization is applied to the weight kernel toreduce the number of bits of element values of the weight kernel, thesize of the memory may be determined according to the number of bits ofthe weight kernel. In addition, the bit width of multipliers or addersarranged in the processing unit that performs multiplication and/oraddition operations by receiving the element values of the weight kernelstored in the memory and the element values of the input feature map,may also be designed according to the number of bits according to theresult of bit quantization.

In another embodiment, the weight kernel may be implemented in ahard-wired form in a processing unit for performing convolution of theconvolutional layer. In this case, after bit quantization is applied tothe weight kernel to reduce the number of bits of the element values ofthe weight kernel, a hard-wire representing each of the element valuesof the weight kernel can be implemented in the processing unit accordingto the number of bits of the weight kernel. In addition, the bit size ofa multiplier or an adder arranged in the processing unit that performsmultiplication and/or addition operations by receiving the elementvalues of the hard-wired weight kernel and the element values of theinput feature map, may also be designed according to the number of bitsresulting from the bit quantization.

FIGS. 19 to 21 described below are diagrams illustrating an example ofhardware implementation of an artificial neural network including aplurality of layers according to the other embodiment of the presentdisclosure. The method and system for bit quantization of an artificialneural network including a plurality of layers according to the presentdisclosure can reduce a required amount of computation, a bit size of anoperator, and a memory by applying the present disclosure to anyartificial neural network (ANN) computing system such as CPU, GPU, FPGA,and ASIC. In addition, in the present example, an embodiment has beenshown based on an integer, but a floating point operation may also beperformed.

FIG. 19 is a diagram illustrating an example of hardware implementationof an artificial neural network according to an embodiment of thepresent disclosure. The illustrated artificial neural network shows anexample in which the convolutional multiplication processing apparatus1900 of the convolutional layer of the convolutional artificial neuralnetwork is implemented in hardware. Here, the convolutional layer willbe described on the assumption that convolution is performed by applyinga weight kernel having a size of 3×3×3 to a part of the input featuremap, i.e., data of a size of 3×3×3. The size and number of weightkernels of each layer may differ depending on the application field andthe number of input/output feature map channels.

As illustrated, the weight kernel may be stored in a weight kernel cache1910 that is separate from the processing unit 1930 for executingconvolution of the convolutional layer. In this case, after applying bitquantization to the weight kernel to reduce the number of bits of theelement values (w 1, w 2, . . . , w 9) of the weighting kernel, the sizeof the cache can be determined according to the number of bits of theweight kernel. In addition, the bit size of the multiplier or adderarranged in the processing unit 1930 that receives the element values ofthe weight kernel stored in the memory and the element values of theinput feature map and performs multiplication and/or additionoperations, may also be designed according to the number of bits of theweight kernel element value resulting from the bit quantization.

According to an embodiment, the input feature map cache 1920 may receiveand store a portion of the input data, i.e., a portion corresponding tothe size of the weight kernel. The weight kernel traverses the inputdata, and the input feature map cache 1920 may sequentially receive andstore a portion of the input data corresponding to the location of theweight kernel. A portion of the input data (x 1, x 2, . . . , x 9)stored in the input feature map cache 1920 and some element values ofthe weight kernel stored in the weight kernel cache 1910 (w 1, w 2, . .. , w 9) are respectively input to a corresponding multiplier 1932 toperform elementwise multiplication. The result values of the elementwisemultiplication by the multiplier 1932 are summed by the tree adder 1934and input to the adder 1940. When the input data is composed of multiplechannels, for example, when the input data is an RGB color image, theadder 1940 may add the value stored in the accumulator 1942, the initialvalue is 0, and the sum value of the input specific channel and store itin the accumulator 1942 again. The sum value stored in the accumulator1942 may be input to the accumulator 1942 by adding it back to the sumvalue of the adder 1940 for the next channel. The summing process of theadder 1940 and the accumulator 1942 is performed for all channels ofinput data, and the total sum value may be input to the outputactivation map cache 1950. The procedure of convolution described abovemay be repeated for a weight kernel and a portion of input datacorresponding to a traversing position on the input data of the weightkernel.

As described above, when the element values of the weight kernel arestored in the weight kernel cache 1910 arranged outside the processingunit 1930, the number of bits of the weight kernel element values can bereduced by bit quantization according to the present disclosure.Accordingly, the size of the weight kernel cache 1910 and the size ofthe multiplier and adder of the processing unit 1930 can be reduced.Further, as the size of the processing unit 1930 decreases, thecomputational speed and power consumption of the processing unit 1930may also decrease.

FIG. 20 is a diagram illustrating an example of hardware implementationof an artificial neural network according to another embodiment of thepresent disclosure.

The illustrated artificial neural network shows an example ofimplementing the convolutional multiplication processing apparatus 2000of the convolutional layer of the convolutional artificial neuralnetwork in hardware. Here, the convolutional layer performs convolutionby applying a weight kernel having a size of 3×3×3 to a portion, i.e.,data of a size of 3×3×3, on the input activation map.

As shown, the weight kernel may be stored in a weight kernel cache 2010separate from the processing unit 2030 for executing convolution of theconvolutional layer. In this case, after bit quantization is applied tothe weight kernel, the number of bits of the element values of theweight kernel (w 1, w 2, . . . , w 9) is reduced, and then the size ofthe cache may be determined according to the number of bits of theweight kernel. In addition, the bit size of the multiplier or adderarranged in the processing unit 2030 that receives the element values ofthe weight kernel stored in the memory and the element values of theinput activation map or feature map and performs multiplication and/oraddition operations, may also be designed according to the number ofbits of the weight kernel element value resulting from the bitquantization.

According to an embodiment, the input activation map cache 2020 mayreceive and store a portion, i.e., a portion corresponding to the sizeof a weight kernel, on input data configured of multiple channels, e.g.,three RGB channels. The weight kernel traverses the input data, and theinput activation map cache 2020 may sequentially receive and store aportion of the input data corresponding to the location of the weightkernel. A portion of the input data (x 1, x 2, . . . , x 27) stored inthe input activation map cache 2020 and the element values of the weightkernel (w 1, w 2, . . . , w 27) stored in the weight kernel cache 2010are each input to a corresponding multiplier to perform elementwisemultiplication. At this time, kernel element values (w 1, w 2, . . . , w9) of the weight kernel cache 2010 and a portion of the first channel ofinput data (x 1, x 2, . . . , x 9) stored in the input activation mapweight cache 2020 are input to the first convolution processing unit2032. In addition, weight kernel element values (w 10, w 11, . . . , w18) of the weight kernel cache 2010 and a portion of the second channelof input data (x 10, x 11, . . . , x 18) stored in the input activationmap cache 2020 are input to the second convolution processing unit 2034.In addition, weight kernel element values (w 19, w 20, . . . , w 27) ofthe weight kernel cache 2010 and a portion of the third channel of inputdata (x 10, x 11, . . . , x 18) stored in the input activation map cache2020 are input to the third convolution processing unit 2036.

Each of the first convolution processing unit 2032, the secondconvolution processing unit 2034, and the third convolution processingunit 2036 may operate in the same manner as the processing unit 1930illustrated in FIG. 19. The result value of the convolution calculatedby each of the first convolution processing unit 2032, the secondconvolution processing unit 2034, and the third convolution processingunit 2036 may be summed by the tree adder 2038 and input to the outputactivation map cache 2040.

As described above, when the element values of the weight kernel arestored in the weight kernel cache 2010 arranged outside the processingunit 2030, the number of bits of the weight kernel element values may bereduced by bit quantization according to the present disclosure.Accordingly, the size of the weight kernel cache 2010 and the size ofthe multiplier and the adder of the processing unit 2030 can be reduced.Further, as the size of the processing unit 2030 decreases, thecomputational speed and power consumption of the processing unit 2030may also decrease.

FIG. 21 is a diagram illustrating an example of hardware implementationof an artificial neural network according to the other embodiment of thepresent disclosure.

The illustrated artificial neural network shows an example ofimplementing the convolutional multiplication processing apparatus 2200of the convolutional layer of the convolutional artificial neuralnetwork in hardware. Here, the convolutional layer performs convolutionby applying a weight kernel having a size of 3×3×3 to a portion, i.e.,data of a size of 3×3×3, on the input activation map.

As shown, the weight kernel may be implemented in a hardwired form inthe processing unit 2220 for executing convolution of the convolutionallayer. In this case, after bit quantization is applied to the weightkernel to reduce the number of bits of the element values of the weightkernel (w 1_K, w 2 K, w 27 K), the size of the cache may be determinedaccording to the number of bits of the weight kernel. In addition, thebit size of the multiplier or adder arranged in the processing unit 2030that receives the element values of the weight kernel implemented aswires in the processing unit 2220 and the element values of the inputactivation map or feature map and performs multiplication and/oraddition operations, may also be designed according to the number ofbits of the weight kernel element value resulting from the bitquantization.

According to an embodiment, the input activation map cache 2210 mayreceive and store a portion, a portion corresponding to the size of aweight kernel, on input data composed of multiple channels, e.g., threeRGB channels. The weight kernel traverses the input data, and the inputactivation map cache 2210 may sequentially receive and store a portionof input data corresponding to the location of the weight kernel. Aportion of the input data (x 1, x 2, . . . , x 27) stored in the inputactivation map cache 2210 and the element values of the weight kernel (w1_K, w 2_K, w 27_K) implemented as wires in the processing unit 2220 arerespectively input to a corresponding multiplier to perform elementwisemultiplication. In this case, the weight kernel element values (w 1_K, w2_K, w 9_K) implemented as wires in the processing unit 2220 and aportion of the first channel of the input data (x 1, x 2, . . . , x 9)stored in the input activation map cache 2210 are input to the firstconvolution processing unit 2222. In addition, weight kernel elementvalues implemented as wires in the processing unit 2220 and a portion ofthe second channel of the input data stored in the input activation mapcache 2210 are input to the second convolution processing unit 2224. Inaddition, the weight kernel element values (w 19 K, w 20_K, w 27 K) ofthe weight kernel cache 2210 and a portion of the third channel of inputdata (x 19, x 20, . . . , x 27) stored in the input activation map cache2210 are input to the third convolution processing unit 2226.

The result of the convolution calculated by each of the firstconvolution processing unit 2222, the second convolution processing unit2224 and the third convolution processing unit 2226 may be summed by thetree adder 2228 and input to the output activation map cache 2230.

As described above, when the element values of the weight kernel areimplemented in a hardwired form in the processing unit 2220, the numberof bits of the weight kernel element values may be reduced by bitquantization according to the present disclosure. Accordingly, there isan effect of reducing the number of wires implemented therein and thesize of the multiplier and adder of the processing unit 2220. Also, asthe size of the processing unit 2220 decreases, the computational speedand power consumption of the processing unit 2220 may also decrease.

FIG. 22 is a diagram illustrating a configuration of a system forperforming bit quantization on an artificial neural network according toan embodiment of the present disclosure.

As shown, the system 2300 may include a parameter selection module 2310,a bit quantization module 2320, and an accuracy determination module2330. The parameter selection module 2310 may analyze configurationinformation of input artificial neural network. The configurationinformation of the artificial neural network may include the number oflayers included in the artificial neural network, the function and roleof each layer, information about the input/output data of each layer,the type and number of multiplications and additions performed by eachlayer, the type of activation function executed by each layer, the typeand configuration of the weight kernel into which each layer is input,the size and number of weight kernels in each layer, the size of theoutput feature map, the initial value of the weight kernel, e.g.,element values of the weight kernel set by mistake, and the like, but itis not limited thereto. The configuration information of the artificialneural network may include information on various elements according tothe type of the artificial neural network, e.g., convolution artificialneural network, recurrent artificial neural network, multilayerperceptron, and the like.

The parameter selection module 2310 may select at least one parameter tobe quantized or a parameter group from the artificial neural networkwith reference to the input artificial neural network configurationinformation. How to select one parameter or data or a parameter group inthe artificial neural network may be determined according to theinfluence of the parameter to be selected on the overall performance orthe amount of computation or the amount of resources required forhardware implementation of the artificial neural network. The selectionof a parameter may be performed by selecting one among one weight, onefeature map and activation map, one weight kernel, all weights in onelayer, all feature maps in one layer or activation map.

In an embodiment, in the case of the convolutional artificial neuralnetwork (CNN) 400 described with reference to FIGS. 4 to 10 describedabove, since the convolutional layer 420 and/or the fully connectedlayer 440 has a large effect on the overall performance or computationalamount of the CNN 400, the weight kernel or the feature map/activationmap of at least one of these layers 420 and 440 may be selected as oneparameter to be quantized.

In an embodiment, at least one of a plurality of layers included in theartificial neural network may be selected and all weight kernels in thelayer or all activation map data of the layer may be set as oneparameter group. The selection method may be determined according to theinfluence of the selected layer on the overall performance orcomputational amount of the artificial neural network, but is notlimited thereto, and may include one among various methods. For example,selection of at least one layer among a plurality of layers included inthe artificial neural network may be executed according to (i) a methodof sequentially selecting a layer from the first layer to which theinput data is received to subsequent layers according to the arrangementorder of the plurality of layers configuring the artificial neuralnetwork, (ii) a method of sequentially selecting the last layer fromwhich the final output data is generated to the previous layer accordingto the arrangement order of the plurality of layers configuring theartificial neural network, (iii) a method of selecting from a layer withthe highest computational amount among a plurality of layers configuringthe artificial neural network, or (iv) a method of selecting from alayer with the least computational amount among the plurality of layersconfiguring the artificial neural network.

When the selection of the data target for quantization of the artificialneural network is completed by the parameter selection module 2310,information of the selected data is input to the bit quantization module2320. The bit quantization module 2320 may reduce the datarepresentation size for the corresponding parameter to a unit of bits byreferring to the input information of the selected parameter. Theresource required for the operation of the selected parameter mayinclude a memory for storing the selected parameter or a data path fortransmitting the selected parameter, but is not limited thereto.

In an embodiment, when the bit quantization module 2320 reduces the datasize of the selected parameter to a unit of bits, the weight kernelquantization and/or activation map quantization described with referenceto FIGS. 4 to 13 may be performed.

When the bit quantization module 2320 completes bit quantization for theselected parameter, it transmits the bit quantized artificial neuralnetwork information to the accuracy determination module 2330. Theaccuracy determination module 2330 may reflect the bit quantizedartificial neural network information in the configuration informationof the artificial neural network input to the system 2300. The bitquantization module 2320 may determine whether the accuracy of theartificial neural network is greater than or equal to a predeterminedtarget value based on the configuration information of the artificialneural network in which the bit quantized artificial neural networkinformation is reflected. For example, after decreasing the sizerepresenting the data of the parameter selected in the artificial neuralnetwork to a unit of bits, if the accuracy of the output result of theartificial neural network, for example, the inference result of theartificial neural network, is more than a predetermined target value,the accuracy determination module 2330 may predict that the overallperformance of the artificial neural network can be maintained even whenadditional bit quantization is performed.

Therefore, when the accuracy determination module 2330 determines thatthe accuracy of the artificial neural network is greater than or equalto the target value, a control signal is transmitted to the parameterselection module 2310 so that the parameter selection module 2310selects another parameter or parameter group included in the artificialneural network. Here, the method of selecting one parameter in theartificial neural network may be executed according to (i) a method ofsequentially selecting the next parameter of the previously selectedparameter according to the arrangement order of each parameter orparameter group configuring the artificial neural network (“forward bitquantization”), (ii) a method of selecting the previous parameter of thepreviously selected parameter in the backward direction according to theorder of arrangement of parameters or parameter groups configuring theartificial neural network (“backward bit quantization”), (iii) a methodof selecting a parameter with the larger amount of computation after thepreviously selected parameter according to the order of the amount ofcomputation among a plurality of parameters configuring the artificialneural network (“High computational cost bit quantization”), or (iv) amethod of selecting a parameter with a smaller amount of computationafter the previously selected parameter according to the order of theamount of computation among a plurality of parameters configuring theartificial neural network (“Low computational cost bit quantization”).

On the other hand, if the accuracy determination module 2330 determinesthat the accuracy of the artificial neural network is not greater thanor equal to the target value, it may determine that the accuracy of theartificial neural network is degraded due to bit quantization performedon the currently selected parameter. Therefore, in this case, the numberof bits determined by the bit quantization performed immediately beforecan be determined as the final number of bits. In one embodiment, theaccuracy of the artificial neural network may mean a probability thatthe artificial neural network will present the correct answer to theproblem in the inference step after learning a solution to a givenproblem, for example, recognition of an object included in an image asinput data. In addition, the target value used in the bit quantizationmethod described above may be expressed with a minimum accuracy to bemaintained after bit quantization of the artificial neural network. Forexample, assuming that the threshold is 90%, additional bit quantizationcan be performed if the accuracy of the artificial neural network is 90%or more even after reducing the memory size for storing the parametersof the layer selected by bit quantization to a unit of bits. Forexample, after performing the first bit quantization, if the accuracy ofthe artificial neural network is measured to be 94%, then additional bitquantization can be performed. After performing the second bitquantization, if the accuracy of the artificial neural network ismeasured to be 88%, then the result of the currently executed bitquantization may be ignored and the number of data representation inbits determined by the first bit quantization can be determined as thefinal bit quantization result.

In one embodiment, according to a computational cost bit quantizationmethod, when selecting a parameter or parameter group to perform bitquantization based on an amount of computation, the amount ofcomputation of each parameter may be determined as follows. That is,when the sum of n bits and m bits is performed in a specific operationof the artificial neural network, the amount of calculation of thecorresponding operation is calculated as (n+m)/2. In addition, whenmultiplying n bits and m bits in a specific operation of the artificialneural network, the amount of operation for the corresponding operationmay be calculated as n x m. Accordingly, the amount of calculation for aspecific parameter of the artificial neural network may be a result ofsumming all the calculation amounts of addition and multiplicationperformed on the parameter.

In this bit quantization, a method of selecting a specific parameter orparameter group can be selected as weight data or feature map andactivation map data belonging to each layer, or each weight kernelbelonging to one layer, or an individual parameter group of each weightdata in one weight kernel.

For reference, the elements shown in FIG. 22 according to an embodimentof the present disclosure may be implemented as software or hardwareelements such as field programmable gate array (FPGA) or applicationspecific integrated circuit (ASIC).

However, ‘elements’ are not meant to be limited to software or hardware,and each element may be configured to be in an addressable storagemedium or may be configured to play one or more processors.

Thus, as an example, the element includes elements such as softwareelements, object-oriented software elements, class elements and taskelements, and processes, functions, properties, procedures,sub-routines, segments of program code, drivers, firmware, microcode,circuits, data, databases, data structures, tables, arrays, andvariables.

The elements and the functions provided within the elements can becombined into a smaller number of elements or further divided intoadditional elements.

Embodiments of the present disclosure may also be implemented in theform of a recording medium including instructions executable by acomputer, such as a program module executed by a computer.Computer-readable medium can be any available media that can be accessedby a computer, and includes both volatile and nonvolatile media,removable and non-removable media. Further, the computer-readable mediummay include both computer storage medium and communication medium.Computer storage medium includes both volatile and nonvolatile,removable and non-removable medium implemented in any method ortechnology for storage of information such as computer readableinstructions, data structures, program modules or other data.Communication medium typically includes computer readable instructions,data structures, program modules, or other data in a modulated datasignal such as a carrier wave, or other transmission mechanism, andincludes any information delivery medium.

Although the present disclosure has been described in connection withsome embodiments herein, it should be understood that variousmodifications and changes can be made without departing from the scopeof the present disclosure as understood by those skilled in the art towhich the present disclosure belongs. In addition, such modificationsand changes should be considered to fall within the scope of the claimsappended to this specification.

1. A hardware for an artificial neural network, comprising: a memoryconfigured to store element values of a weight kernel or element valuesof a feature map of the artificial neural network in which allparameters or all parameter groups are quantized, wherein the artificialneural network is quantized by: selecting at least one parameter fromthe parameters or at least one parameter group from the parametergroups, executing bit quantization to reduce a size of a datarepresentation for the selected at least one parameter or the selectedat least one parameter group to a unit of bits, determining whetheraccuracy of the artificial neural network according to the bitquantization applied to the selected at least one parameter or theselected at least one parameter group is greater than or equal to atarget value, and responsive to the accuracy of the artificial neuralnetwork being greater than or equal to the target value, repeatedlyexecuting the bit quantization; and a processing unit for performingconvolution, including a plurality of multipliers or a plurality ofadders designed to have a bit size corresponding to a number ofquantization bits of the quantized artificial neural network.
 2. Thehardware of claim 1, wherein in the quantized artificial neural network,the at least one parameter or the at least one parameter group issequentially quantized based on an amount of computation or an amount ofmemory.
 3. The hardware of claim 1, wherein the processing unit isfurther configured to process the quantized artificial neural network byat least one of a computational cost bit quantization method, a forwardbit quantization method, or a backward bit quantization method.
 4. Thehardware of claim 2, wherein the amount of computation and the amount ofmemory of the quantized artificial neural network are relatively reducedcompared to those before quantization, and a number of bits of each dataof the at least one parameter or the at least one parameter group storedin the memory is reduced.
 5. The hardware of claim 1, wherein the memoryincludes at least one of a buffer memory, a register memory, or a cachememory.
 6. The hardware of claim 1, wherein the quantized artificialneural network includes a plurality of layers, wherein a size of a databit of a data path through which data of a specific layer among theplurality of layers is transmitted is reduced in a unit of bits.
 7. Thehardware of claim 1, wherein in the quantized artificial neural network,bit quantization is executed to reduce a storage size of the memoryconfigured to store the at least one parameter or the at least oneparameter group.
 8. The hardware of claim 1, wherein the memory furtherincludes at least one of a weight kernel cache or an input feature mapcache.
 9. The hardware of claim 1, wherein the processing unit furtherincludes a tree adder configured to sum result values of elementwisemultiplication by the plurality of multipliers.
 10. The hardware ofclaim 1, further comprising: an adder connected to the processing unitand an accumulator connected to the adder.
 11. The hardware of claim 1,further comprising: an output activation map cache configured to store aresult value of convolution of the processing unit.
 12. The hardware ofclaim 1, wherein the processing unit further includes a plurality ofconvolution processing units.
 13. The hardware of claim 12, wherein theprocessing unit further includes a tree adder configured to sum resultvalues of convolution of each of the plurality of convolution processingunits.
 14. A method for quantizing bits of a multi-layered artificialneural network having a plurality of layers, executed by a system, themethod comprising: selecting at least one layer from among the pluralityof layers in an order of having a large amount of memory or a smallamount of memory; bit quantizing to reduce a size of a datarepresentation for a parameter of the selected layer to a unit of bits;determining whether accuracy of the multi-layered artificial neuralnetwork after the bit quantizing is greater than or equal to a targetvalue; and executing the bit quantizing responsive to an accuracy of theartificial neural network being greater than or equal to the targetvalue.
 15. The method of claim 14, further comprising: determining thesize of the data representation for the parameter of the selected layerthat satisfies the accuracy greater than or equal to the target value asa final number of bits for the parameter of the selected layer,responsive to the accuracy of the artificial neural network being lessthan the target value.
 16. The method of claim 15, further comprising:selecting at least one layer in which the final number of bits for theparameter is not determined among the plurality of layers, andrepeatedly executing the bit quantizing to determine the final number ofbits of the selected at least one layer in which the final number ofbits is not determined.
 17. The method of claim 14, wherein the bitquantizing to reduce the size to the unit of bits is configured toreduce the size to a unit of 1 bit.
 18. The method of claim 14, whereinthe parameter of the selected at least one layer includes at least oneof weight data, feature map data, or activation map data.
 19. The methodof claim 14, wherein a number of bits of a multiplier and an adder ofthe processing unit for processing the multi-layered artificial neuralnetwork, the bits of which are quantized, is designed to correspond tothe number of bits according to a result of the bit quantizing.
 20. Aconvolutional multiplication processing apparatus, comprising: a memoryconfigured to store a quantized artificial neural network, a weightkernel and a feature map of the quantized artificial neural network,wherein the quantized artificial neural network is quantized by:selecting a parameter of at least one layer, executing bit quantizationto reduce a size of a data representation for the selected parameter ofthe at least one layer to a unit of bits, determining whether accuracyof the artificial neural network according to the bit quantizationapplied to the selected at least one parameter of the at least one layeror at least one parameter group is greater than or equal to a targetvalue, and responsive to the accuracy of the artificial neural networkbeing greater than or equal to the target value, repeatedly executingthe bit quantization; and a plurality of multipliers or a plurality ofadders configured to process convolution by receiving the weight kerneland the feature map of the quantized artificial neural network, anddesigned to have a bit size corresponding to a number of quantizationbits of the quantized artificial neural network.